12 research outputs found

    An Effective and Efficient Graph Representation Learning Approach for Big Graphs

    Get PDF
    In the Big Data era, large graph datasets are becoming increasingly popular due to their capability to integrate and interconnect large sources of data in many fields, e.g., social media, biology, communication networks, etc. Graph representation learning is a flexible tool that automatically extracts features from a graph node. These features can be directly used for machine learning tasks. Graph representation learning approaches producing features preserving the structural information of the graphs are still an open problem, especially in the context of large-scale graphs. In this paper, we propose a new fast and scalable structural representation learning approach called SparseStruct. Our approach uses a sparse internal representation for each node, and we formally proved its ability to preserve structural information. Thanks to a light-weight algorithm where each iteration costs only linear time in the number of the edges, SparseStruct is able to easily process large graphs. In addition, it provides improvements in comparison with state of the art in terms of prediction and classification accuracy by also providing strong robustness to noise data

    Unsupervised Structural Graph Node Representation Learning

    No full text
    Unsupervised Graph Representation Learning methods learn a numerical representation of the nodes in a graph. The generated representations encode meaningful information about the nodes\u27 properties, making them a powerful tool for tasks in many areas of study, such as social sciences, biology or communication networks. These methods are particularly interesting because they facilitate the direct use of standard Machine Learning models on graphs. Graph representation learning methods can be divided into two main categories depending on the information they encode, methods preserving the nodes connectivity information, and methods preserving nodes\u27 structural information. Connectivity-based methods focus on encoding relationships between nodes, with neighboring nodes being closer together in the resulting latent space. On the other hand, structure-based methods generate a latent space where nodes serving a similar structural function in the network are encoded close to each other, independently of them being connected or even close to each other in the graph. While there are a lot of works that focus on preserving nodes\u27 connectivity information, only a few works study the problem of encoding nodes\u27 structure, specially in an unsupervised way. In this dissertation, we demonstrate that properly encoding nodes\u27 structural information is fundamental for many real-world applications, as it can be leveraged to successfully solve many tasks where connectivity-based methods fail. One concrete example is presented first. In this example, the task consists of detecting malicious entities in a real-world financial network. We show that to solve this problem, connectivity information is not enough and show how leveraging structural information provides considerable performance improvements. This particular example pinpoints the need for further research on the area of structural graph representation learning, together with the limitations of the previous state-of-the-art. We use the acquired knowledge as a starting point and inspiration for the research and development of three independent unsupervised structural graph representation learning methods: Structural Iterative Representation learning approach for Graph Nodes (SIR-GN), Structural Iterative Lexicographic Autoencoded Node Representation (SILA), and Sparse Structural Node Representation (SparseStruct). We show how each of our methods tackles specific limitations on the previous state-of-the-art on structural graph representation learning such as scalability, representation meaning, and lack of formal proof that guarantees the preservation of structural properties. We provide an extensive experimental section where we compare our three proposed methods to the current state-of-the-art on both connectivity-based and structure-based representation learning methods. Finally, in this dissertation, we look at extensions of the basic structural graph representation learning problem. We study the problem of temporal structural graph representation. We also provide a method for representation explainability

    SIR-GN: A Fast Structural Iterative Representation Learning Approach for Graph Nodes

    No full text
    Graph representation learning methods have attracted an increasing amount of attention in recent years. These methods focus on learning a numerical representation of the nodes in a graph. Learning these representations is a powerful instrument for tasks such as graph mining, visualization, and hashing. They are of particular interest because they facilitate the direct use of standard machine learning models on graphs. Graph representation learning methods can be divided into two main categories: methods preserving the connectivity information of the nodes and methods preserving nodes’ structural information. Connectivity-based methods focus on encoding relationships between nodes, with connected nodes being closer together in the resulting latent space. While methods preserving structure generate a latent space where nodes serving a similar structural function in the network are encoded close to each other, independently of them being connected or even close to each other in the graph. While there are a lot of works that focus on preserving node connectivity, only a few works focus on preserving nodes’ structure. Properly encoding nodes’ structural information is fundamental for many real-world applications as it has been demonstrated that this information can be leveraged to successfully solve many tasks where connectivity-based methods usually fail. A typical example is the task of node classification, i.e., the assignment or prediction of a particular label for a node. Current limitations of structural representation methods are their scalability, representation meaning, and no formal proof that guaranteed the preservation of structural properties. We propose a new graph representation learning method, called Structural Iterative Representation learning approach for Graph Nodes (SIR-GN). In this work, we propose two variations (SIR-GN: GMM and SIR-GN: K-Means) and show how our best variation SIR-GN: K-Means: (1) theoretically guarantees the preservation of graph structural similarities, (2) provides a clear meaning about its representation and a way to interpret it with a specifically designed attribution procedure, and (3) is scalable and fast to compute. In addition, from our experiment, we show that SIR-GN: K-Means is often better or, in the worst-case comparable than the existing structural graph representation learning methods present in the literature. Also, we empirically show its superior scalability and computational performance when compared to other existing approaches

    Detecting Suspicious Entities in Offshore Leaks Networks

    No full text
    The ICIJ Offshore Leaks Database represents a large set of relationships between people, companies, and organizations involved in the creation of offshore companies in tax-heaven territories, mainly for hiding their assets. This data are organized into four networks of entities and their interactions: Panama Papers, Paradise Papers, Offshore Leaks, and Bahamas Leaks. For instance, the entities involved in the Panama Papers networks are people or companies that had affairs with the Panamanian offshore law firm Mossack Fonseca, often with the purpose of laundering money. In this paper, we address the problem of searching the ICIJ Offshore Leaks Database for people and companies that may be involved in illegal acts. We use a collection of international blacklists of sanctioned people and organizations as ground truth for bad entities. We propose a new ranking algorithm, named Suspiciousness Rank Back and Forth (SRBF), that, given one of the networks in the ICIJ Offshore Leaks Database, leverages the network structure and the blacklist ground truth to assign a degree of suspiciousness to each entity in the network. We experimentally show that our algorithm outperforms existing techniques for node classification achieving area under the ROC curve ranging from 0.69 to 0.85 and an area under the recall curve ranging from 0.70 to 0.84 on three of the four considered networks. Moreover, our algorithm retrieves bad entities earlier in the rank than competitors. Further, we show the effectiveness of SRBF on a case study on the Panama Papers network

    Evaluating the Impact of Social Media in Detecting Health-Violating Restaurants

    No full text
    Nowadays, detecting health-violating restaurants is a serious problem due to the limited number of health inspectors in a city as compared to the number of restaurants. Rarely inspectors are helped by formal complains, but many complaints are reported as reviews on social media such as Yelp. In this paper we propose new predictors to detect health-violating restaurants based on restaurant sub-area location, previous inspections history, Yelp reviews content, and Yelp users behavior. The resulting method outperforms past work, with a percentage of improvement in Cohen’s kappa and Matthews correlation coefficient of at least 16%. In addition, we define a new method that directly evaluates the benefit of a classifier on the ability of an inspector in detecting health-violating restaurants. We show that our classification method really improves the ability of the inspector and outperforms previous solutions

    Inferring Bad Entities Through the Panama Papers Network

    No full text
    The Panama Papers represent a large set of relationships between people, companies, and organizations that had affairs with the Panamanian offshore law firm Mossack Fonseca, often due to money laundering. In this paper, we address for the first time the problem of searching the Panama Papers for people and companies that may be involved in illegal acts. We use a collection of international blacklists of sanctioned people and organizations as ground truth for bad entities. We propose a new ranking algorithm, named Suspiciousness Rank Back and Forth (SRBF), that leverages this ground truth to assign a degree of suspiciousness to each entity in the Panama Papers. We experimentally show that our algorithm achieves an AUROC of 0.85 and an Area Under the Recall Curve of 0.87 and outperforms existing techniques

    Large-Scale Sparse Structural Node Representation

    No full text
    In the BigData era, large graph datasets are becoming increasingly popular due to their capability to integrate and interconnect large sources of data in many fields, e.g., social media, biology, communication networks, etc. Graph representation learning is a flexible tool that automatically extracts features from a graph node. These features can be directly used for machine learning tasks. Graph representation learning approaches producing features preserving the structural information of the graphs are still an open problem, especially in the context of largescale graphs. In this paper, we propose a new fast and scalable structural representation learning approach called SparseStruct. Our approach uses a sparse internal representation for each node, and we formally proved its ability to preserve structural information. Thanks to a light-weight algorithm where each iteration costs only linear time in the number of the edges, SparseStruct is able to easily process large graphs. In addition, it provides improvements in comparison with state of the art in terms of prediction and classification accuracy by also providing strong robustness to noise data

    RIBS: Risky Blind-Spots for Attack Classification Models

    No full text
    Nowadays, there has been an increment in the use of machine learning methods for cyber-security applications. These methods can be prone to generalization, especially in a binary attack classification setting, where the objective is to differentiate between benign vs. malicious behavior. This generalization creates risky security blind-spot weaknesses that make the system vulnerable. Current attackers are well aware of these blind-spots and as a counter-strategy, they exploit such vulnerabilities to bypass security measures and achieve their nefarious objectives. In this work, we propose a methodology to mitigate the problem, RIsky Blind-Spot (RIBS), by making the classification more robust. Our proposed approach creates a generator model that can learn the real characteristics of the data, and consequently, sample real examples targeting the blind-spots of a classifier. We validate our methodology in the context of power grids, where we show how this framework can improve the detection of unknown malicious behavior. Our approach provides an increment of 10% in terms of accuracy and detected attacks when compared to the baseline method

    Unknown Landscape Identification with CNN Transfer Learning

    No full text
    Unknown landscape identification is the problem of identifying an unknown landscape from a set of already provided landscape images that are considered to be known. The aim of this work is to extract the intrinsic semantic of landscape images in order to automatically generalize concepts like a stadium, roads, a parking lot etc., and use this concept to identify unknown landscapes. This problem can be easily extended to many security applications. We propose two effective semi-supervised novelty detection approaches for the unknown landscape identification problem using Convolutional Neural Network (CNN) Transfer Learning. This is based on the use of pre-trained CNNs (i.e. already trained on large datasets) already containing general image knowledge that we transfer to our domain. Our best values of AUROC and Average Precision scores for the identification problem are 0.96 and 0.94, respectively. In addition, we statistically prove that our semi-supervised methods outperform the baseline

    A Multi-Perspective Approach for the Analysis of Complex Business Processes Behavior

    No full text
    Business processes are often monitored by transactional information systems that produce massive dataset called event logs. Such logs contain the process execution traces, typically characterized by heterogeneous and high-dimensional data. Process mining techniques offer a great opportunity to gain valuable knowledge hidden in the data to be used for analysing the multiple characteristics of processes (i.e. perspectives in process mining, like structural aspects, activities, resources, data and time). Therefore, raw data must be encoded into a suitable format that can be more conveniently provided to the mining algorithms. However, most of the existing process encoding techniques focus on the control-flow perspective, i.e. only encode the sequence of activities that characterize a trace, leaving out other process perspectives that are fundamental for describing the process behavior in all its aspects. In this paper we address the problem of computing a concise and informative representation of execution traces that considers the multiple perspectives of the process behavior. We propose a holistic approach that computes trace embedding able to capture patterns of dependencies between the perspectives that are lost in a one-dimensional analysis and, at the same time, it is unsupervised, meaning that no a priori knowledge is needed. The experiments conducted on two real life logs demonstrate that our proposed embedding is appropriate to concisely describe the multiple and various characteristics of the processes, and that the proposed method outperforms existing trace encoding techniques. Furthermore, the embedding includes the elapsed time between events as an additional feature to make us capable to use it as a further dimension of analysis
    corecore